Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run queries through ad-hoc SSH tunnels #4797

Merged
merged 9 commits into from
May 11, 2020
Merged

Run queries through ad-hoc SSH tunnels #4797

merged 9 commits into from
May 11, 2020

Conversation

rauchy
Copy link
Contributor

@rauchy rauchy commented Apr 12, 2020

What type of PR is this? (check all applicable)

  • Feature

Description

Redash requires network access to the data sources it needs to query, but sometime these data sources aren't publicly available. A way around this is using SSH tunnels, which can be setup manually outside Redash, but can also be setup automatically before a connection is made, and closed automatically after they are no longer required.

This PR sets up ad-hoc SSH tunnels for data sources if they have a ssh_tunnel dict in their options, which must include an ssh_host and ssh_username of the bastion server to connect through.

In order to enable SSH tunnels, dynamic_settings.ssh_tunnel_auth() must be implemented to allow the Redash instance to authenticate with the bastion server(s).

Related Issues

#2013

@rauchy rauchy requested a review from arikfr April 12, 2020 14:28
Comment on lines 76 to 102
@property
def host(self):
if "host" in self.configuration:
return self.configuration["host"]
else:
raise NotImplementedError()

@host.setter
def host(self, host):
if "host" in self.configuration:
self.configuration["host"] = host
else:
raise NotImplementedError()

@property
def port(self):
if "port" in self.configuration:
return self.configuration["port"]
else:
raise NotImplementedError()

@port.setter
def port(self, port):
if "port" in self.configuration:
self.configuration["port"] = port
else:
raise NotImplementedError()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment explaining the purpose of these properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 8da215b

@rauchy
Copy link
Contributor Author

rauchy commented Apr 29, 2020

@arikfr forceful schema refreshes can't be handled by gunicorn, so I offloaded them to RQ and polled for their results using the /jobs endpoint. This seems to work well. If you think the implementation is on track, I'll use the same method for connection testing.

return True


@job("schemas", queue_class=Queue, at_front=True, timeout=30, ttl=90)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately for some data sources timeout needs to be much longer. Let's start with 5 minutes.

Comment on lines +305 to 306
"result": result,
"query_result_id": query_result_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid accumulating tech debt, maybe we can return query_result_id only when it's a execute_query task?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(only if it's simple check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to avoid accumulating tech debt, we should probably always return result and ditch query_result_id, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this will break scripts that use /jobs for polling and expect query_result_id... :| We could really use API versioning at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that's why I wen't for doubling result and query_result_id, as the cheapest thing that won't break and won't be too awkward :(

Comment on lines -189 to -195
except NotSupported:
response["error"] = {
"code": 1,
"message": "Data source type does not support retrieving schema",
}
except Exception:
response["error"] = {"code": 2, "message": "Error retrieving schema."}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep reporting these error codes. For example, we might show different UI when it's not supported. And we don't want to throw random errors at the user but rather show "Error retrieving schema." (it might make sense to show some of the errors, but it's beyond the scope of this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too, but then I realized we aren't really using the error codes defined in the service. The current implementation (returning [] on NotSupported from the backend) mimics the same behavior we have today - silent failing on NotSupported and a "Schema refresh failed." message when errors occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that we don't really use them anymore in code -- this is something that happened during the conversion to React, but we should someday. It's better to be explicit about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redash/tasks/general.py Outdated Show resolved Hide resolved
@arikfr arikfr merged commit 9562718 into master May 11, 2020
@arikfr arikfr deleted the auto-ssh-tunnels branch May 11, 2020 10:22
Copy link
Member

@jezdez jezdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arikfr @rauchy I have a few detailed questions about the job based schema fetching as I think this adds some risks regarding performance and user experience.

Apologies for not adding those comments before merging, I thought doing here than in a new ticket gives more context, but please let me know if you want me to open one anyway.

return True


@job("schemas", queue_class=Queue, at_front=True, timeout=300, ttl=90)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arikfr @rauchy I think this job adds a big risk of filling up Redis when a lot of people try to load a query or refresh the schema since it stores the job result for the default 500 seconds and data source schema can be quite big depending on the number of tables.

Since the result is effectively used only once when loaded I would suggest a small amount of 30 seconds and make it a configurable number. WDYT?

function getSchema(dataSource, refresh = undefined) {
if (!dataSource) {
return Promise.resolve([]);
}

const fetchSchemaFromJob = (data) => {
return sleep(1000).then(() => {
Copy link
Member

@jezdez jezdez May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a noticeable delay to loading the schema browser without indication that it's loading the schema, could we add a "Loading... please wait" screen or something similar? Or reduce the sleep interval to something that is closer to an unnoticeable perception level?

Also, in case of a worker failure that prevents the job from ever returning a good result (e.g. Redis key eviction), could this add an upper limit on the number of tries to load the data and then show a button to fetch the schema anew with a new job?

@jezdez
Copy link
Member

jezdez commented May 13, 2020

Oh one other comment, Marina and me spent a lot of time on a custom schema drawer feature in our fork (that was also merged here a while ago but was backed out after feedback from Arik): mozilla@109eac8

Since this PR adds a background job to load the schema, I want to point out that our schema drawer feature has modified the refresh_schema job to store the schema in Redash's postgres to be able to

  1. amend the schema with metadata such as description, column samples and example queries but
  2. to cache the response for the schema drawer API endpoint with a Redis lock based refresh system to prevent the thundering herd problem of the refresh schema button.

The results are lower traffic to cost intensive data sources (less $$$), immediate responses to the user and the ability to provide more details about the individual data source tables and columns.

If you'd like to get a quick demo again, please let me know, we're obviously eager to have alignment with you.

@jezdez
Copy link
Member

jezdez commented May 29, 2020

@arikfr @rauchy We're about to start our rebase downstream to pull in these changes and I was wondering if you had seen my comments above about the anti-patterns in this feature around loading schema with a noticeable 1 second delay and a risk of raise conditions if a RQ worker doesn't process the get_schema job?

@arikfr
Copy link
Member

arikfr commented May 31, 2020

We need to do a follow up on this:

  1. Make the API return cached version if it's available immediately (this will address the 1 second delay).
  2. Add some progress indicator (NTH though, as we had the same issue in the past).

Adding a lock could be nice, but I'm not sure if it's needed once we move the caching functionality to the API layer. Basically at this point it returns being the same behavior as before (in terms of number of restarts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants